Sexual dimorphism study Differential Expression

Doing the differential expression analysis.

    library(RnaSeqTutorial)
    samples <- read.csv(file.path(extdata(),"sex-samples.csv"))
    res <- mclapply(dir(file.path(extdata(),"htseq"),
                    pattern="^[2,3].*_STAR\\.txt",
                    full.names=TRUE),function(fil){
      read.delim(fil,header=FALSE,stringsAsFactors=FALSE)
    },mc.cores=2)
    names(res) <- gsub("_.*_STAR\\.txt","",dir(file.path(extdata(),"htseq"),
                                           pattern="^[2,3].*_STAR\\.txt"))
    addInfo <- c("__no_feature","__ambiguous",
             "__too_low_aQual","__not_aligned",
             "__alignment_not_unique")
    sel <- match(addInfo,res[[1]][,1])
    count.table <- do.call(cbind,lapply(res,"[",2))[-sel,]
    colnames(count.table) <- names(res)
    rownames(count.table) <- res[[1]][,1][-sel]
    conditions <- names(res)
    sex <- samples$sex[match(names(res),samples$sample)]
    date <- factor(samples$date[match(names(res),samples$sample)])
    pal=brewer.pal(8,"Dark2")

The QA revealed that we have a confounding factor. Luckily, enough the grouping of sex is balanced between sampling years and hence we will be able to block the year effect to investigate the sex effect.

yearSexDesign <- samples[grep("[2,3].*",samples$sample),c("sex","date")]

DESeq

The primary reason to use DESeq over DESeq2 or edgeR or any other is that DESeq is very conservative. The second reason is that DESeq give less weight to outliers than DESeq2 does and based on the study by Pakull et al., which identify a candidate gene for sexual determination - the same we simultaneously, independantly originally identified in our analyses - it seems that our sample 226.1 may be mis-classified. However, the sex phenotyping has been done thoroughly, so it may also be that the sex determination is a more complex trait and that Potri.019G047300 is only one factor influencing it As introduced we want to look at the sex by blocking the year effect. We start by estimating the size factors and the dispersion

cdsFull <- newCountDataSet(count.table,yearSexDesign)
cdsFull = estimateSizeFactors( cdsFull )
cdsFull = estimateDispersions( cdsFull )

We can now check the dispersion estimated by DESeq, which is, obviously, conservative. I.e. most dispersion values lie below the actual dispersion fit by about 1 fold.

plotDispLSD(cdsFull)

Next we create both models (one considering the date only and on the date and sex)

fit1 = suppressWarnings(fitNbinomGLMs( cdsFull, count ~ date + sex ))
## ...............................
fit0 = suppressWarnings(fitNbinomGLMs( cdsFull, count ~ date))
## ...............................

For the rest of the analysis, we ignore the genes that did not converge in the previous step

sel <- !is.na(fit1$converged) & !is.na(fit0$converged) & fit1$converged & fit0$converged
fit1 <- fit1[sel,]
fit0 <- fit0[sel,]

We next calculate the GLM p-value and adjust them for multiple testing

pvalsGLM = nbinomGLMTest( fit1, fit0 )
padjGLM = p.adjust( pvalsGLM, method="BH" )

We finally visualize the obtained results. A first insight shows that a number of genes have very high fold changes, which is due the fact that these genes have very few read in very few samples. For clarity, in the following plots, we filter those genes with a too high FC.

boxplot(rowSums(count.table[rownames(fit1[fit1$sexM > 20,]),]>0),
        ylab="number of sample with expression > 0",
        main="distribution of the # of samples with\nexpression > 0 for genes with extreme high log2FC"
)

boxplot(rowSums(count.table[rownames(fit1[fit1$sexM < -20,]),]>0),
        ylab="number of sample with expression > 0",
        main="distribution of the # of samples with\nexpression > 0 for genes with extreme low log2FC"
)

The vast majority of these genes are anyway not significant.

plot(density(padjGLM[fit1$sexM > 20]),
     main="Adj. p-value for genes with extreme log2FC",
     xlab="Adj. p-value")

plot(density(padjGLM[fit1$sexM > -20]),
     main="Adj. p-value for genes with extreme log2FC",
     xlab="Adj. p-value")

So we filter them out

sel <- fit1$sexM > -20 & fit1$sexM < 20
fit1 <- fit1[sel,]
padjGLM <- padjGLM[sel]
pvalsGLM <- pvalsGLM[sel]

A further look into the data reveals that one p-value is equal to 0. As this is inadequate for plotting log odds (-log of the p-value), we change these 0s to be 1 log10 fold change smaller than the otherwise smallest p-value (for plotting purposes only!)

padjGLM[padjGLM==0] <- min(padjGLM[padjGLM!=0])/10
pvalsGLM[pvalsGLM==0] <- min(pvalsGLM[pvalsGLM!=0])/10

Visualize DESeq results

As a volcano plot with the non-adjusted p-values The red dots circles are the 4 most significantly differentially expressed genes, and the 8 with the absolute highest log2 fold changes (4 positive and 4 negative)

heatscatter(fit1$sexM,-log10(pvalsGLM),
            main="Male vs. Female Differential Expression",cor=FALSE,
            xlab="Log2 Fold Change", ylab="- log(10) p-value")
legend("topleft",bty="n","1% p-value cutoff",lty=2,col="gray")
points(fit1[pvalsGLM<.01,"sexM"],
       -log10(pvalsGLM[pvalsGLM<.01]),
       col="lightblue",pch=19)
pos <- c(order(pvalsGLM)[1:4],
         order(fit1$sexM,decreasing=TRUE)[1:4],
         order(fit1$sexM)[1:4])
points(fit1[pos,"sexM"],-log10(pvalsGLM[pos]),col="red")
abline(h=2,lty=2,col="gray")

As a volcano plot with adjusted p-values

heatscatter(fit1$sexM,-log10(padjGLM),
            main="Male vs. Female Differential Expression",cor=FALSE,
            xlab="Log2 Fold Change", ylab="- log(10) adj. p-value")
legend("topleft",bty="n","1% FDR cutoff",lty=2,col="gray")
points(fit1[padjGLM<.01,"sexM"],-log10(padjGLM[padjGLM<.01]),col="lightblue",pch=19)
pos <- c(order(padjGLM)[1:4],order(fit1$sexM,decreasing=TRUE)[1:4],order(fit1$sexM)[1:4])
points(fit1[pos,"sexM"],-log10(padjGLM[pos]),col="red")
abline(h=2,lty=2,col="gray")

As DESeq is an “older” implementation of this type of analysis, we will redo the analysis using DESeq2. To ultimately be able to compare the obtained set of differential expression candidate genes, we save the list of genes significant at a 1% adjusted p-value cutoff.

UmAspD <- rownames(fit1[padjGLM<.01,])

which are not so many (one)

UmAspD
## [1] "Potra000503g03273.0"



Time for Socrative question number 18



DESeq2 differential expression analyses

We redo the analyses performed above. As you will see, it has been made more intuitive in DESeq2 and the same can be achieved in fewer commands.

dds <- DESeqDataSetFromMatrix(
    countData = count.table,
    colData = data.frame(condition=conditions,
                       sex=sex,
                       date=date),
    design = ~date+sex)
dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
plotDispEsts(dds)

res <- results(dds,contrast=c("sex","M","F"))

In 4 commands we have reproduced the analysis and we have observed that the dispersion estimation looked different. DESeq2 introduces a so called “shrinkage”, which you can learn more about in the DESeq2 vignettes. We, next, extract the results using the same 1% cutoff and plot similar validation plots

alpha=0.01
plotMA(res,alpha=alpha)

volcanoPlot(res,alpha=alpha)

## NULL

The volcano plot look similar to what we observed previously (except that we have by mistake inverted the ratio calculation, we now did F-M), although slightly less dead that in DESeq. We also do not get any genes with a p-value of 0

hist(res$padj,breaks=seq(0,1,.01))

sum(res$padj<alpha,na.rm=TRUE)
## [1] 3

4 genes are candidates for differential expression at a 1% adjusted p-value cutoff, which we also save for later comparison

UmAspD2 <- rownames(res[order(res$padj,na.last=TRUE),][1:sum(res$padj<alpha,na.rm=TRUE),])
UmAspD2
## [1] "Potra003910g23469.0" "Potra001414g11999.0" "Potra004533g25038.0"

We can already observe that one gene is common with the previous list, but that the chromosome 19 candidate has disappeared. This is surprising, given the Pakull et al. publication, that Potri.019G047300.0 (the homolog of Potra000503g03273) does not come out anymore as a differentially expressed candidate gene at a 1% FDR cutoff with DESeq2. But it’s adjusted p-value is 2.1% - see below. Moreover it’s cook distance - a measure of the homogeneity of the gene expression within a condition - is relatively high (2x the average) and might indicate the presence of an outlier. Looking into more details at the count values and sex assignment shows that the sample 226.1 behaves like a male sample with regards to that gene, so either the sample has been mis-sexed or the Pakull et al. results in P. tremuloides and P.tremula are only a partial view of the true biology.

as.data.frame(res)["Potra000503g03273.0",]
##                     baseMean log2FoldChange lfcSE stat  pvalue padj
## Potra000503g03273.0      125            1.1  0.24  4.6 3.9e-06 0.02
data.frame(
  counts=counts(dds,normalized=TRUE)["Potra000503g03273.0",],
  sex,
  date,
  conditions)
##       counts sex date conditions
## 202     7.64   F 2008        202
## 207   145.27   M 2010        207
## 213.1   0.00   F 2008      213.1
## 221   261.99   M 2008        221
## 226.1  23.02   F 2010      226.1
## 229.1 305.21   M 2008      229.1
## 229    79.92   M 2010        229
## 235    91.24   M 2010        235
## 236     0.92   F 2010        236
## 239     0.00   F 2010        239
## 244     2.22   F 2010        244
## 303   476.78   M 2008        303
## 305.3  87.38   F 2008      305.3
## 309.1 237.93   M 2008      309.1
## 310.3  49.12   F 2008      310.3
## 337.1 349.03   M 2008      337.1
## 349.2   5.15   F 2008      349.2
mcols(dds)[names(rowRanges(dds))=="Potra000503g03273.0",]
## DataFrame with 1 row and 36 columns
##    baseMean   baseVar   allZero dispGeneEst   dispFit dispersion  dispIter
##   <numeric> <numeric> <logical>   <numeric> <numeric>  <numeric> <numeric>
## 1       125     21809     FALSE         1.3      0.13       0.99        10
##   dispOutlier   dispMAP Intercept  date2008  date2010      sexF      sexM
##     <logical> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## 1       FALSE      0.99       6.2      0.64     -0.64     -0.56      0.56
##   SE_Intercept SE_date2008 SE_date2010   SE_sexF   SE_sexM MLE_Intercept
##      <numeric>   <numeric>   <numeric> <numeric> <numeric>     <numeric>
## 1         0.35        0.25        0.25      0.12      0.12           4.8
##   MLE_date_2010_vs_2008 MLE_sex_M_vs_F WaldStatistic_Intercept
##               <numeric>      <numeric>               <numeric>
## 1                  -1.9            3.7                      18
##   WaldStatistic_date2008 WaldStatistic_date2010 WaldStatistic_sexF
##                <numeric>              <numeric>          <numeric>
## 1                    2.6                   -2.6               -4.6
##   WaldStatistic_sexM WaldPvalue_Intercept WaldPvalue_date2008
##            <numeric>            <numeric>           <numeric>
## 1                4.6              4.7e-69              0.0099
##   WaldPvalue_date2010 WaldPvalue_sexF WaldPvalue_sexM  betaConv  betaIter
##             <numeric>       <numeric>       <numeric> <logical> <numeric>
## 1              0.0099         3.9e-06         3.9e-06      TRUE        18
##    deviance  maxCooks
##   <numeric> <numeric>
## 1       183      0.77



Time for Socrative question number 19



DESeq2 with the sample re-sexed

Nevertheless, while waiting for the next time the tree flowers and we can confirm its sex with certainty, we can assume that the sample was mis-sexed and see what effect a sex correction would have. Given the data from Pakull et al. about that gene; assuming that the sample 226.1 was mis-sexed, we redo the DESeq2 analysis after swaping the sex of that sample

sex[conditions=="226.1"] <- "M"
dds <- DESeqDataSetFromMatrix(
  countData = count.table,
  colData = data.frame(condition=conditions,
                       sex=sex,
                       date=date),
  design = ~ date+sex)
dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
res.swap <- results(dds,contrast=c("sex","M","F"))
plotMA(res.swap,alpha=alpha)

volcanoPlot(res.swap,alpha=alpha)

## NULL
hist(res.swap$padj,breaks=seq(0,1,.01))

sum(res.swap$padj<alpha,na.rm=TRUE)
## [1] 7
UmAspD2.swap <- rownames(res.swap[order(res.swap$padj,na.last=TRUE),][1:sum(res.swap$padj<alpha,na.rm=TRUE),])
UmAspD2.swap
## [1] "Potra000503g03273.0" "Potra001218g10512.0" "Potra001519g12651.0"
## [4] "Potra003910g23469.0" "Potra005027g34078.0" "Potra181824g28189.0"
## [7] "Potra002176g16783.0"

Fair enough, the gene made it back in the list.

DESeq2 analysis - sample removed

But, still, to assess the importance of the 226.1 sample, we redo the DESeq2 DE analysis without it (since we have enough replicate for that removal not to affect the power of our analysis)

sel <- conditions != "226.1"
dds <- DESeqDataSetFromMatrix(
  countData = count.table[,sel],
  colData = data.frame(condition=conditions[sel],
                       sex=sex[sel],
                       date=date[sel]),
  design = ~ date+sex)
dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
plotDispEsts(dds)

res.sel <- results(dds,contrast=c("sex","M","F"))
plotMA(res.sel,alpha=alpha)

volcanoPlot(res.sel,alpha=alpha)

## NULL
hist(res.sel$padj,breaks=seq(0,1,.01))

sum(res.sel$padj<alpha,na.rm=TRUE)
## [1] 8
UmAspD2.sel <- rownames(res.sel[order(res.sel$padj,na.last=TRUE),][1:sum(res.sel$padj<alpha,na.rm=TRUE),])
UmAspD2.sel
## [1] "Potra003910g23469.0" "Potra005027g34078.0" "Potra000503g03273.0"
## [4] "Potra001414g11999.0" "Potra001519g12651.0" "Potra004533g25038.0"
## [7] "Potra001218g10512.0" "Potra178727g27972.0"

Results comparison

We compare the results from the 4 approaches, the easiest way being to plot a Vann Diagram

plot.new()
grid.draw(venn.diagram(list(
  UmAsp=UmAspD2,
  UmAsp.swap=UmAspD2.swap,
  UmAsp.sel=UmAspD2.sel,
  UmAspD = UmAspD),
                       filename=NULL,
                       col=pal[1:4],
                       category.names=c("UmAsp - DESeq2",
                                        "UmAsp - corrected",
                                        "UmAsp - removed",
                                        "UmAsp - DESeq")))

Unlike in the original analysis conducted in the study, where the Populus trichocarpa genome was used as a reference (the P. tremula genome assembly is still a draft assembly), where the gene Potri.014G155300.0 was constantly found by all 4 approaches and the gene Potri.019G047300.0 (Potra000503g03273.0 homolog) was found by DESeq and DESeq2, there is no overlap between both native methods. This is probably because the result in P. tremula are obscured by an assembly artefact. Indeed we know that the two genes in P. trichocarpa have a very high sequence similarity (being putatively very recent paralogs), which appears to have been collapsed into a single locus in P. tremula. We can also note that “messing” with the sample sex attribute or removing that sample leads to the identification of several more genes! It is unclear if these are the results of a technical error or if they would be worth investigating further.

sort(intersect(UmAspD2,UmAspD))
## character(0)
sort(intersect(UmAspD2.swap,UmAspD))
## [1] "Potra000503g03273.0"
sort(intersect(UmAspD2.sel,UmAspD))
## [1] "Potra000503g03273.0"



Time for Socrative question number 20



Appendix

Analyses performed when asking Mike Love (DESeq2 developer) why DESeq2 seem so sensitive to misclassification The short answer was to use the Cook distance to further investigate that and that the 1% FDR cutoff was conservative. These are commented out on purpose. Note that the rowData function has been replaced by rowRanges in Bioconductor version 3.1 and above.

## mis-classified
# sex <- samples$sex[match(colnames(count.table),samples$sample)]
# date <- factor(samples$date[match(colnames(count.table),samples$sample)])
# ddsM <- DESeqDataSetFromMatrix(
#   countData = count.table,
#   colData = data.frame(condition=conditions,
#                        sex=sex,
#                        date=date),
#   design = ~ date+sex)
# ddsM <- DESeq(ddsM)
# resM <- results(ddsM,contrast=c("sex","F","M"))
# 
# counts(ddsM,normalized=TRUE)["Potri.019G047300.0",]
# resM["Potri.019G047300.0",]
# mcols(ddsM)[names(rowData(ddsM)) == "Potri.019G047300.0",]
# 
# ## reclassified
# sexR <- sex
# sexR[5] <- "M"
# ddsR <- DESeqDataSetFromMatrix(
#   countData = count.table,
#   colData = data.frame(condition=conditions,
#                        sex=sexR,
#                        date=date),
#   design = ~ date+sex)
# ddsR <- DESeq(ddsR)
# resR <- results(ddsR,contrast=c("sex","F","M"))
# 
# counts(ddsR,normalized=TRUE)["Potri.019G047300.0",]
# resR["Potri.019G047300.0",]
# mcols(ddsR)[names(rowData(ddsR)) == "Potri.019G047300.0",]
# 
# ## samples details
# sex
# date
# 
# ## the model is not be affected by the reclassification
# lapply(split(sex,date),table)
# lapply(split(sexR,date),table)
# 

1 References